Skip to content

Fix flaky testWritesRejectedDueToTimeLagBreach#20869

Merged
cwperks merged 1 commit intoopensearch-project:mainfrom
andrross:flaky-rsbit
Mar 20, 2026
Merged

Fix flaky testWritesRejectedDueToTimeLagBreach#20869
cwperks merged 1 commit intoopensearch-project:mainfrom
andrross:flaky-rsbit

Conversation

@andrross
Copy link
Member

The test assumed that a fixed number of index+refresh iterations during the failure phase would always accumulate enough time lag to trigger backpressure rejection. If the test machine is slow then upload time moving average might be inflated, raising the dynamic threshold (avgUploadTime * 10) beyond what the time lag can reach in only 3 iterations.

Replace the fixed-iteration assertThrows with an assertBusy loop that keeps indexing until the backpressure rejection occurs or the 30-second timeout is reached.

Related Issues

Resolves #17936

Check List

  • Functionality includes testing.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@andrross andrross requested a review from a team as a code owner March 14, 2026 01:33
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Mar 14, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 14, 2026

PR Reviewer Guide 🔍

(Review updated until commit 0f4a2f1)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Silent Success

Inside the assertBusy loop, when indexDocAndRefresh does NOT throw an exception, the code calls fail(...). However, assertBusy retries on AssertionError, so calling fail() (which throws AssertionError) will cause the loop to keep retrying — this is the intended behavior. But if the exception is caught and exRef is set, the method returns normally, which means assertBusy considers that iteration successful and stops. This is correct. However, if indexDocAndRefresh throws a different (non-OpenSearchRejectedExecutionException) exception, it will propagate out of assertBusy unexpectedly. Consider catching a broader exception or adding explicit handling for unexpected exceptions.

assertBusy(() -> {
    try {
        indexDocAndRefresh(onFailureSource, 1);
    } catch (OpenSearchRejectedExecutionException e) {
        exRef.set(e);
        return;
    }
    fail("Expected OpenSearchRejectedExecutionException to be thrown");
}, 30, TimeUnit.SECONDS);
Null Reference

After assertBusy completes, exRef.get() is used directly without a null check. If assertBusy somehow exits without the exception being set (e.g., due to a timeout where the last iteration succeeded without throwing), ex would be null, causing a NullPointerException on the subsequent assertTrue calls. An explicit assertion that exRef.get() is not null would make the failure message clearer.

OpenSearchRejectedExecutionException ex = exRef.get();
assertTrue(ex.getMessage().contains("rejected execution on primary shard"));
assertTrue(ex.getMessage().contains(breachMode));

@github-actions
Copy link
Contributor

github-actions bot commented Mar 14, 2026

PR Code Suggestions ✨

Latest suggestions up to 0f4a2f1

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against null reference after busy assertion

The assertBusy lambda calls fail(...) on every iteration where no exception is
thrown, which causes assertBusy to retry — but if the busy assertion itself times
out without ever catching the exception, exRef.get() will return null, causing a
NullPointerException on the subsequent assertTrue calls. You should add a null-check
or assert that exRef.get() is not null before using it.

server/src/internalClusterTest/java/org/opensearch/remotestore/RemoteStoreBackpressureAndResiliencyIT.java [92-101]

 assertBusy(() -> {
     try {
         indexDocAndRefresh(onFailureSource, 1);
     } catch (OpenSearchRejectedExecutionException e) {
         exRef.set(e);
         return;
     }
     fail("Expected OpenSearchRejectedExecutionException to be thrown");
 }, 30, TimeUnit.SECONDS);
 OpenSearchRejectedExecutionException ex = exRef.get();
+assertNotNull("Expected OpenSearchRejectedExecutionException to have been caught", ex);
Suggestion importance[1-10]: 6

__

Why: If assertBusy times out without ever catching the exception, exRef.get() returns null, causing a NullPointerException on the subsequent assertTrue calls. Adding a null-check via assertNotNull would make the failure message clearer and prevent the NPE, though in practice assertBusy would throw its own timeout assertion error first.

Low

Previous suggestions

Suggestions up to commit 0d42f4e
CategorySuggestion                                                                                                                                    Impact
Possible issue
Allow busy-wait loop to retry on failure

The assertBusy lambda calls fail(...) on every iteration where no exception is
thrown, which will immediately abort the busy-wait loop on the first attempt rather
than retrying. The lambda should throw an AssertionError (or use assertTrue(false,
...)) only after the timeout, or restructure so that non-exception iterations simply
retry. Instead, throw a retryable assertion so assertBusy can keep polling until the
condition is met or the timeout expires.

server/src/internalClusterTest/java/org/opensearch/remotestore/RemoteStoreBackpressureAndResiliencyIT.java [92-100]

 assertBusy(() -> {
     try {
         indexDocAndRefresh(onFailureSource, 1);
     } catch (OpenSearchRejectedExecutionException e) {
         exRef.set(e);
         return;
     }
-    fail("Expected OpenSearchRejectedExecutionException to be thrown");
+    throw new AssertionError("OpenSearchRejectedExecutionException not yet thrown, retrying...");
 }, 30, TimeUnit.SECONDS);
Suggestion importance[1-10]: 8

__

Why: This is a valid and important fix. assertBusy works by catching AssertionError to retry; calling fail() throws an AssertionError that assertBusy will catch and retry on, but the semantics are misleading. More critically, using throw new AssertionError(...) makes the retry behavior explicit and correct, ensuring the loop retries until the exception is thrown or the timeout expires rather than potentially confusing the intent.

Medium
Guard against null reference after busy assertion

When assertBusy times out without ever catching the expected exception, exRef.get()
will return null, causing a NullPointerException on the subsequent assertTrue calls.
Add a null-check or assertion after retrieving the value from exRef to provide a
clear failure message in that case.

server/src/internalClusterTest/java/org/opensearch/remotestore/RemoteStoreBackpressureAndResiliencyIT.java [91-101]

 AtomicReference<OpenSearchRejectedExecutionException> exRef = new AtomicReference<>();
 assertBusy(() -> {
     try {
         indexDocAndRefresh(onFailureSource, 1);
     } catch (OpenSearchRejectedExecutionException e) {
         exRef.set(e);
         return;
     }
     fail("Expected OpenSearchRejectedExecutionException to be thrown");
 }, 30, TimeUnit.SECONDS);
 OpenSearchRejectedExecutionException ex = exRef.get();
+assertNotNull("Expected OpenSearchRejectedExecutionException but none was captured", ex);
Suggestion importance[1-10]: 5

__

Why: While technically valid, assertBusy itself throws an AssertionError if the timeout expires without the condition being met (due to fail() calls), so exRef.get() returning null after assertBusy completes successfully is unlikely. The null check adds minor defensive safety but has low practical impact.

Low
Suggestions up to commit 6df872b
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add null check after busy-wait loop

After assertBusy completes, exRef.get() could still be null if assertBusy itself
throws an AssertionError (e.g., timeout) rather than completing normally with a
caught exception. You should add a null check or assert that exRef.get() is not null
before calling methods on it to avoid a NullPointerException.

server/src/internalClusterTest/java/org/opensearch/remotestore/RemoteStoreBackpressureAndResiliencyIT.java [101-102]

 assertBusy(() -> {
     try {
         indexDocAndRefresh(onFailureSource, 1);
     } catch (OpenSearchRejectedExecutionException e) {
         exRef.set(e);
         return;
     }
     fail("Expected OpenSearchRejectedExecutionException to be thrown");
 }, 30, TimeUnit.SECONDS);
 OpenSearchRejectedExecutionException ex = exRef.get();
+assertNotNull("Expected OpenSearchRejectedExecutionException to have been captured", ex);
Suggestion importance[1-10]: 5

__

Why: While technically valid, if assertBusy times out it throws an AssertionError which would propagate before reaching exRef.get(), making the null case unlikely in practice. The suggestion adds a minor safety check but has limited practical impact.

Low

@github-actions
Copy link
Contributor

❌ Gradle check result for 6df872b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 0d42f4e

@github-actions
Copy link
Contributor

❌ Gradle check result for 0d42f4e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 0d42f4e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

The test assumed that a fixed number of index+refresh iterations during
the failure phase would always accumulate enough time lag to trigger
backpressure rejection. If the test machine is slow then upload time
moving average might be inflated, raising the dynamic threshold
(avgUploadTime * 10) beyond what the time lag can reach in only 3
iterations.

Replace the fixed-iteration assertThrows with an assertBusy loop
that keeps indexing until the backpressure rejection occurs or
the 30-second timeout is reached.

Signed-off-by: Andrew Ross <andrross@amazon.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit 0f4a2f1

@github-actions
Copy link
Contributor

❕ Gradle check result for 0f4a2f1: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@codecov
Copy link

codecov bot commented Mar 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@316c070). Learn more about missing BASE report.
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #20869   +/-   ##
=======================================
  Coverage        ?   73.21%           
  Complexity      ?    72468           
=======================================
  Files           ?     5819           
  Lines           ?   331352           
  Branches        ?    47875           
=======================================
  Hits            ?   242587           
  Misses          ?    69258           
  Partials        ?    19507           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cwperks cwperks merged commit 9142d0e into opensearch-project:main Mar 20, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run skip-changelog >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for RemoteStoreBackpressureAndResiliencyIT

2 participants